Linear regression: Inference with regresssion models

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

CS 10.1 revisited: Dungeness crab growth

Recall the model equation for the best-fit line1 is

\[ y_i = \beta_0 + \beta_1 \times x_i + \varepsilon_i, ~ \text{where} ~ \varepsilon_i \sim \text{Normal}(0, \sigma_\varepsilon) \]

We want to estimate the values of \(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\) with the \(n\) observations in our data—By minimising the sum of squares for residuals, \(SSR\)

To do this in R, we use the same command we did for one-way ANOVA: lm()

# Fit the best-fitting straight line to the data
crabs.df <- read.csv("datasets/dungeness-crabs.csv")
crabs.fit <- lm(Premolt ~ Postmolt, data = crabs.df)

R output of the best-fit line

# Prints a summary of the linear regression model we 
#   fitted to crabs.df
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

Fitted intercept term, \(\widehat{\beta}_0\)

Fitted gradient term, \(\widehat{\beta}_1\)

Standard error of the residuals, \(\widehat{\sigma}_\varepsilon\)

Inference for β1

The best “guess”1 of \(\beta_1\) according to our data is denoted as \(\widehat{\beta}_1\)

By itself, it tells us the direction of the fitted linear relationship and the average change in the response variable as the explanatory variable increases by one unit

# Prints a summary of the linear regression model we 
#   fitted to crabs.df
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

The 95% confidence interval for β1

We are 95% sure that a one mm increase in postmolt size increases the average premolt size by somewhere between 1.08 and 1.12 mm

What if we wanted to use the model to infer the average change in premolt size if we increase the postmolt size by five mm?

# Prints the confidence interval for the betas
confint(crabs.fit, level = 0.95)
                 2.5 %     97.5 %
(Intercept) -32.005093 -25.889773
Postmolt      1.078267   1.120694
# Multiply the confidence interval by 5 units
5 * confint(crabs.fit, level = 0.95)
                  2.5 %      97.5 %
(Intercept) -160.025465 -129.448864
Postmolt       5.391336    5.603471

The hypothesis test for β1

A p-value is also reported in the summary() output for \(\widehat{\beta}_1\). What is the null hypothesis it is generating evidence against in favour of the alternative hypothesis?

\[ H_0\!: \beta_1 = 0 \\ H_1\!: \beta_1 \neq 0 \]

We have very strong evidence against the null hypothesis that the slope term is equal to 0 in favour that it is not (p-value = 0)

# Prints a summary of the linear regression model we 
#   fitted to crabs.df
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

Inference for β0

The best “guess”1 of \(\beta_0\) according to our data is denoted as \(\widehat{\beta}_0\)

By itself, it tells us average of the response, \(\hat{y}_i\), when the value of the explanatory, \(x_i\), is equal to 0. That is, the value where the best-fit line cuts the y-axis

# Prints a summary of the linear regression model we 
#   fitted to crabs.df
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

The 95% confidence interval for β0

We estimate with 95% confidence that the mean premolt size is somewhere between -32.0 and -25.0 mm when the postmolt size is 0 mm

In practice, the interpretation of \(\beta_0\) is not of interest and we do not mind if it is plausibly equal to 0. However, we need it to fit the best-fit line to our data!

# Prints the confidence interval for the betas
confint(crabs.fit, level = 0.95)
                 2.5 %     97.5 %
(Intercept) -32.005093 -25.889773
Postmolt      1.078267   1.120694

The hypothesis test for β0

A p-value is also reported in the summary() output for \(\widehat{\beta}_0\). What is the null hypothesis it is generating evidence against in favour of the alternative hypothesis?

\[ H_0\!: \beta_0 = 0 \\ H_1\!: \beta_0 \neq 0 \]

We have very strong evidence against the null hypothesis that the intercept term is equal to 0 in favour that it is not (p-value = 0)

# Prints a summary of the linear regression model we 
#   fitted to crabs.df
summary(crabs.fit)

Call:
lm(formula = Premolt ~ Postmolt, data = crabs.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6485 -1.3060  0.0829  1.2683 11.0291 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.94743    1.55480  -18.62   <2e-16 ***
Postmolt      1.09948    0.01079  101.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.996 on 359 degrees of freedom
Multiple R-squared:  0.9666,    Adjusted R-squared:  0.9665 
F-statistic: 1.039e+04 on 1 and 359 DF,  p-value: < 2.2e-16

Inference for the linear regression fit

Like one-way ANOVA, we could also decompose the “variability” of the response variable by what is described by the best-fit line and the residuals. Hence we could conduct a F-test with the following hypotheses

\[ \begin{aligned} H_0\!&: \text{The model is ineffective} \\ H_1\!&: \text{The model is effective} \end{aligned} \]

We have very strong evidence against the null hypothesis that the fitted model is ineffective in modelling a linear relationship between premolt and postmolt sizes of crabs in favour that it is not (p-value = 0)

# Decompose the total "variability" of the premolt sizes
anova(crabs.fit)
Analysis of Variance Table

Response: Premolt
           Df Sum Sq Mean Sq F value    Pr(>F)    
Postmolt    1  41393   41393   10389 < 2.2e-16 ***
Residuals 359   1430       4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# The total "variability" of the premolt sizes
(nrow(crabs.df) - 1) * var(crabs.df$Premolt)
[1] 42823

CS 10.2: 24 hour exams

Simulated data based on an 24 hour open-book and online exam held at a real university

Context

In one trimester, exams were held over 24 hours, though students could still do many exams within 2 hours. An exam ran from 1pm one day to 1pm the next day. The lecturers were interested if there would be any pattern in the exam marks depending on the final submission time within these 24 hours

Lecturers collected data from an introductory statistics paper to investigate whether such a pattern existed
Variables
Score An integer denoting the student’s exam mark out of 80
Submit A number denoting the number of hours after the exam was released before the student submitted their exam. This is a number between 0 and 24
submit.df <- read.csv("datasets/submission.csv")
nrow(submit.df)
[1] 2283
summary(submit.df)
     Score           Submit     
 Min.   :10.00   Min.   : 0.97  
 1st Qu.:43.00   1st Qu.: 7.10  
 Median :52.00   Median :14.28  
 Mean   :52.26   Mean   :14.64  
 3rd Qu.:62.00   3rd Qu.:22.84  
 Max.   :80.00   Max.   :24.00  

Exploring the data

xyplot(Score ~ Submit, data = submit.df,
       main = "Exam marks vs final submission times of students",
       xlab = "Final submission time (hours)",
       ylab = "Exam mark (out of 80)")

Figure: The exam marks and final submission times of 2283 students

Let’s quickly check the sample correlation to help describe what we see…

cor.test( ~ Score + Submit, data = submit.df)$estimate
        cor 
-0.01269911 

Fitting the best-fit line

submit.fit <- lm(Score ~ Submit, data = submit.df)
summary(submit.fit)

Call:
lm(formula = Score ~ Submit, data = submit.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.063  -9.381  -0.089   9.893  27.935 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 52.57575    0.58546  89.803   <2e-16 ***
Submit      -0.02137    0.03524  -0.607    0.544    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.22 on 2281 degrees of freedom
Multiple R-squared:  0.0001613, Adjusted R-squared:  -0.0002771 
F-statistic: 0.3679 on 1 and 2281 DF,  p-value: 0.5442
confint(submit.fit)
                  2.5 %      97.5 %
(Intercept) 51.42766870 53.72384007
Submit      -0.09047597  0.04772819

Linear regression ⇝ The one-sample t-test for μ

We can use the lm() R function to fit only \(\beta_0\) and \(\varepsilon_i\) to \(y_i\) like so

submit.fit.2 <- lm(Score ~ 1, data = submit.df)

Score ~ 1 has a special meaning associated with it, that is, fit a single mean to the response variable…

What is the difference between the two methods…?

summary(submit.fit.2)

Call:
lm(formula = Score ~ 1, data = submit.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.263  -9.263  -0.263   9.737  27.737 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  52.2628     0.2767   188.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.22 on 2282 degrees of freedom

Comparing R output

summary(submit.fit.2)

Call:
lm(formula = Score ~ 1, data = submit.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.263  -9.263  -0.263   9.737  27.737 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  52.2628     0.2767   188.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.22 on 2282 degrees of freedom
confint(submit.fit.2)
               2.5 %   97.5 %
(Intercept) 51.72024 52.80539
t.test(Score ~ 1, data = submit.df)

    One Sample t-test

data:  Score
t = 188.89, df = 2282, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 51.72024 52.80539
sample estimates:
mean of x 
 52.26281